Distributed Web-Scale Infrastructure for Crawling, Indexing and Search with Semantic Support
نویسندگان
چکیده
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
منابع مشابه
Integrating RDF Querying Capabilities into a Distributed Search Infrastructure
The Semantic Web is inherently distributed, and covers both metadata and full-text information. Semantic search therefore can profit a lot from peer-to-peer infrastructures as well as from powerful metadata search functionalities based on full-text search technologies. In this paper we focus on an approach extending an existing P2P search infrastructure with RDF querying capabilities, which bot...
متن کاملBuilding the Infrastructure of Resource Sharing: Union Catalogs, Distributed Search, and Cross-Database Linkage
EFFECTIVE R SOURCE SHARING PRESUPPOSES an infrastructure which permits users to locate materials of interest in both print and electonic formats. TWO approaches for providing this are union catalogs and Z39.50-based distributed search systems. The advantages and limitations of each approach are considered, paying particular attention to a realistic assessment of 239.50 implementations. This art...
متن کاملHow to Build Google2Google - An (Incomplete) Recipe
This talk explores aspects relevant for peer-to-peer search infrastructures, which we think are better suited to semantic web search than centralized approaches. It does so in the form of an (incomplete) cookbook recipe, listing necessary ingredients for putting together a distributed search infrastructure. The reader has to be aware, though, that many of these ingredients are research question...
متن کاملWeb-crawling reliability
In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selectiv...
متن کاملEfficient Proposed Framework for Semantic Search Engine using New Semantic Ranking Algorithm
The amount of information raises billions of databases every year and there is an urgent need to search for that information by a specialize tool called search engine. There are many of search engines available today, but the main challenge in these search engines is that most of them cannot retrieve meaningful information intelligently. The semantic web technology is a solution that keeps data...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Science (AGH)
دوره 13 شماره
صفحات -
تاریخ انتشار 2012